In the previous part, we reviewed the basic statistical concepts behind the likelihood inference and the Bayesian inference. We looked at how to write a JAGS model function for some linear and generalized linear regression models and use it in the package ‘dclone’ to get the Bayesian credible intervals as well as the frequentist confidence intervals based on the asymptotic normal distribution. We also discussed some of the reasons to use the MCMC approach to conducting statistical inference, either Bayesian or frequentist.
We will now generalize the models to make them relevant to some complex practical situations. These are some of the situations where the analytical approaches to Bayesian and likelihood inference are difficult to impossible to implement. The question of estimability of the parameters becomes much more relevant but difficult to diagnose. The method of data cloning is particularly useful for diagnosing estimability of the parameters. You will notice that, although the models are much more complex, the coding component does not increase in complexity. We will also discuss prediction of missing data.
Let us revisit the occupancy model again. In practice, the assumption that you observe the occupancy status correctly is somewhat suspect. For example, if we are looking for a bird species, if the bird never sings or gives some sort of a cue, it is extremely difficult to know if they are there. Hence, even if the species is present, we may make an error and note that it is not present. This is called ‘detection error’. How can we model this?
Let \(W_i\) denote the true status of the i-th cell. So in our previous notation, now \(P(W_{i}=1)=\phi\). The observed value, generally denoted by \(Y_i\) could be 1 or 0 depending on the true status. If we assume that the species are never misidentified, then we can write \(P(Y_i=1|W_{i}=1)=p\) and \(P(Y_i=0|W_{i}=1)=1-p\). Moreover, \(P(Y_i=0|W_{i}=0)=1\). In principle, we can also have misidentification. For example, a coyote could be mistaken for a wolf. But we will not discuss it here. It is a fairly simple extension of this model. Probability of detection is \(p\) and probability of occupancy is \(\phi\). How can we infer about these given the data?
Notice that we only observe \(Y_i\)’s and not the \(W_i\)’s. The unobserved variable \(W_i\) is called a latent variable.
To write down the likelihood function, we need to compute the distribution of the observed data \(Y_i\). We can see that \(P(Y_{i}=1)=p \phi\) and \(P(Y_{i}=0)=(1-p) \phi\). We can write down the likelihood based on this. However, we are going to write this as a hierarchical model.
Let us modify the previous code to see how this inference proceeds.
library(dclone)
## Loading required package: coda
## Loading required package: parallel
## Loading required package: Matrix
## dclone 2.3-1 2023-04-07
phi.true = 0.3 # occupancy
p.true = 0.7 # detectability
n = 30 # sample size
W = rbinom(n, 1 ,phi.true) # true status
Y = rbinom(n, 1, W * p.true) # detections
Step 1: WE need to define the model function. This is the critical component.
Occ.model = function(){
# Likelihood: Latent variables are random variables.
for (i in 1:n){
W[i] ~ dbin(phi_occ, 1)
Y[i] ~ dbin(W[i] * p_det, 1)
}
# Priors
phi_occ ~ dbeta(1, 1)
p_det ~ dbeta(1, 1)
}
Now we need to provide the data to the model and generate random numbers from the posterior. We will discuss different options later.
dat = list(Y=Y, n=n)
Following command will not work. But try it by removing the comments hash to see what happens.
Occ.Bayes = jags.fit(data=dat, params=c("phi_occ","p_det"), model=Occ.model)
## Registered S3 method overwritten by 'R2WinBUGS':
## method from
## as.mcmc.list.bugs dclone
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 30
## Unobserved stochastic nodes: 32
## Total graph size: 94
##
## Initializing model
## Deleting model
## Error in rjags::jags.model(model, data, n.chains = n.chains, n.adapt = n.adapt, : Error in node Y[1]
## Node inconsistent with parents
This did not quite work. When there are latent variables, many times, we have to start the process at the appropriate initial values.
ini = list(W=rep(1, n))
Occ.Bayes = jags.fit(data=dat, params=c("phi_occ","p_det"), model = Occ.model,
inits=ini)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 30
## Unobserved stochastic nodes: 32
## Total graph size: 94
##
## Initializing model
summary(Occ.Bayes)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## p_det 0.5178 0.2213 0.001807 0.008916
## phi_occ 0.5233 0.2234 0.001824 0.010015
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## p_det 0.1780 0.3375 0.4844 0.6848 0.9588
## phi_occ 0.1803 0.3405 0.4890 0.6945 0.9646
plot(Occ.Bayes)
This seems to work quite well! But our answers are quite weird (We know the truth!). Let us plot the two dimensional (joint) distribution of the parameters.
pairs(Occ.Bayes)
This suggests that the posterior distribution is banana shaped. But just looking at these plots, we cannot say for sure if our answers are correct or not.
Should we, then, accept the answers? Not so fast. Let us look at the model again. It is clear that we can estimate the product \(p \phi\) given the data. But decomposing this product in \(p\) and \(\phi\) is impossible. This is called ‘non-estimability’.
In this case, this also is non-identifiability. There are several combinations of \(p\) and \(\phi\) that lead to the same \(p \phi\) and hence the same distribution of the observed data. Such situations are not uncommon when dealing with the hiearchical models in general, and measurement error models in particular.
We should not make any inferences about the probability of occupancy based on these data. You can change the priors and see what happens to the posteriors. You might find it interesting and educational.
Non-estimability: If there are two or more values in the parameter space that lead to identical likelihood value, such values are called ‘non-estimable’.
Note: You may recall from the linear regression that if the covariates are perfectly correlated with each other, the regression coefficients are non-estimable. If covariate \(X_1\) is perfectly correlated with \(X_2\), these covariates separately give no additional information.
If the posterior distribution converges to a non-degenerate distribution as the sample size increases, it implies that set of parameters is non-estimable.
If the posterior distribution converges to a non-degenerate distribution as the number of clones increases, it implies that set of parameters is non-estimable.
An immediate consequence of this result is that the variance of the posterior distribution does not converge to 0 (instead it converges to some positive number).
Let us modify our data cloning code to see what happens.
Occ.model.dc = function(){
# Likelihood
for(k in 1:ncl){
for (i in 1:n){
W[i,k] ~ dbin(phi_occ, 1)
Y[i,k] ~ dbin(W[i,k] * p_det, 1)
}
}
# Prior
phi_occ ~ dbeta(1, 1)
p_det ~ dbeta(1, 1)
}
We need to turn the original data into an array. And we need to add
another index ncl for the cloned dimension. It gets
multiplied by the number of clones.
Y = array(Y, dim=c(n, 1))
Y = dcdim(Y)
dat = list(Y=Y, n=n, ncl=1)
As previously, we need to initiate the W’s.
ini = list(W=array(rep(1, n), dim=c(n, 1)))
We need to clone these initial values as well. You should always check if this is doing the right job.
initfn = function(model, n.clones){
W=array(rep(1, n), dim=c(n, 1))
list(W=dclone(dcdim(W), n.clones))
}
initfn(n.clones=2)
## $W
## clone.1 clone.2
## [1,] 1 1
## [2,] 1 1
## [3,] 1 1
## [4,] 1 1
## [5,] 1 1
## [6,] 1 1
## [7,] 1 1
## [8,] 1 1
## [9,] 1 1
## [10,] 1 1
## [11,] 1 1
## [12,] 1 1
## [13,] 1 1
## [14,] 1 1
## [15,] 1 1
## [16,] 1 1
## [17,] 1 1
## [18,] 1 1
## [19,] 1 1
## [20,] 1 1
## [21,] 1 1
## [22,] 1 1
## [23,] 1 1
## [24,] 1 1
## [25,] 1 1
## [26,] 1 1
## [27,] 1 1
## [28,] 1 1
## [29,] 1 1
## [30,] 1 1
## attr(,"n.clones")
## [1] 2
## attr(,"n.clones")attr(,"method")
## [1] "dim"
## attr(,"n.clones")attr(,"method")attr(,"drop")
## [1] TRUE
Let’s run data cloning.
Occ.DC = dc.fit(data=dat, params=c("phi_occ","p_det"), model=Occ.model.dc,
n.clones=c(1, 2, 5), unchanged="n", multiply="ncl",
inits=ini, initsfun=initfn)
##
## Fitting model with 1 clone
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 30
## Unobserved stochastic nodes: 32
## Total graph size: 95
##
## Initializing model
##
##
## Fitting model with 2 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 60
## Unobserved stochastic nodes: 62
## Total graph size: 185
##
## Initializing model
##
##
## Fitting model with 5 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 150
## Unobserved stochastic nodes: 152
## Total graph size: 455
##
## Initializing model
summary(Occ.DC)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
## Number of clones = 5
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD DC SD Naive SE Time-series SE R hat
## p_det 0.5088 0.2139 0.4783 0.001746 0.01778 1.050
## phi_occ 0.5423 0.2207 0.4935 0.001802 0.02106 1.049
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## p_det 0.2198 0.3297 0.4620 0.6659 0.9543
## phi_occ 0.2252 0.3520 0.5076 0.7095 0.9714
plot(Occ.DC)
pairs(Occ.DC)
There are a couple of diagnostic tools available in ‘dclone’ for the non-estimability issue.
dcdiag(Occ.DC)
## n.clones lambda.max ms.error r.squared r.hat
## 1 1 0.09094702 0.2781267 0.02762617 1.011784
## 2 2 0.08971954 0.1862829 0.01814363 1.010879
## 3 5 0.08953779 0.1470202 0.01975095 1.051241
dcdiag(Occ.DC) |> plot()
lambda.max is converging to zero. The rate
at which this converges to zero should be approximately
1/K.Availability of the estimability diagnostics is one of the most important features of data cloning. It will warn you if your scientific inferences could be misleading.
If the parameters are non-estimable, the only recourse one has is to change the model (add assumptions or collect different kind of data). There is always a possibility that, although the full parameter space may not be estimable, a function of the parameter might be estimable. If such a function is also of scientific interest, we can safely conduct scientific inferences based on estimates of such a function of the parameters.
Suppose we visit the same cell several times. Assume that the visits are independent of each other and the true occupancy status remains the same throughout these surveys, then we can estimate the parameters. The model can be written as a hierarchical model:
Notice that hierarchy 2 depends on hierarchy 1 result.
We can easily modify the earlier code to allow for multiple surveys. We will do such a modification with two surveys for each cell.
phi.true = 0.3
p.true = 0.7
n = 30
v = 2 # number of visits
W = rbinom(n, 1, phi.true)
Y = NULL
for (j in 1:v)
Y <- cbind(Y, rbinom(n, 1, W * p.true))
Step 1: we need to define the model function.
Occ.model = function(){
# Likelihood: Latent variables are random variables.
for (i in 1:n){
W[i] ~ dbin(phi_occ, 1)
for (j in 1:v){
Y[i,j] ~ dbin(W[i] * p_det, 1)}
}
# Priors
phi_occ ~ dbeta(1, 1)
p_det ~ dbeta(1, 1)
}
Now we need to provide the data to the model and generate random numbers from the posterior. We will discuss different options later.
dat = list(Y=Y, n=n, v=v)
ini = list(W=rep(1, n))
Occ.Bayes = jags.fit(data=dat, params=c("phi_occ","p_det"), model=Occ.model,
inits=ini)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 60
## Unobserved stochastic nodes: 32
## Total graph size: 125
##
## Initializing model
summary(Occ.Bayes)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## p_det 0.4821 0.1888 0.001542 0.005357
## phi_occ 0.2985 0.1562 0.001276 0.005641
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## p_det 0.14546 0.3322 0.4802 0.6242 0.8366
## phi_occ 0.09913 0.1904 0.2620 0.3649 0.7270
plot(Occ.Bayes)
pairs(Occ.Bayes)
These are nice unimodal posterior distributions.
We can modify the data cloning code to check if the parameters are, in fact, estimable.
Occ.model.dc = function(){
# Likelihood
for(k in 1:ncl){
for (i in 1:n){
W[i,k] ~ dbin(phi_occ, 1)
for (j in 1:v){
Y[i,j,k] ~ dbin(W[i,k] * p_det, 1)
}
}
}
# Prior
phi_occ ~ dbeta(1, 1)
p_det ~ dbeta(1, 1)
}
We need to turn the original data into an array. And we need to add
another index ncl for the cloned dimension. It gets
multiplied by the number of clones.
Y = array(Y, dim=c(n,v,1))
Y = dcdim(Y)
dat = list(Y=Y, n=n, v=v, ncl=1)
# As previously, we need to initiate the W's.
ini = list(W=array(rep(1, n), dim=c(n, 1)))
# We need to clone these initial values as well. You should always check if this is doing the right job.
initfn =function(model, n.clones){
W=array(rep(1,n), dim=c(n,1))
list(W=dclone(dcdim(W), n.clones))
}
Occ.DC = dc.fit(data=dat, params=c("phi_occ","p_det"), model=Occ.model.dc,
n.clones=c(1, 2, 5),
unchanged=c("n","v"), multiply="ncl",
inits=ini, initsfun=initfn)
##
## Fitting model with 1 clone
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 60
## Unobserved stochastic nodes: 32
## Total graph size: 126
##
## Initializing model
##
##
## Fitting model with 2 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 120
## Unobserved stochastic nodes: 62
## Total graph size: 246
##
## Initializing model
##
##
## Fitting model with 5 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 300
## Unobserved stochastic nodes: 152
## Total graph size: 606
##
## Initializing model
summary(Occ.DC)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
## Number of clones = 5
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD DC SD Naive SE Time-series SE R hat
## p_det 0.5502 0.09755 0.2181 0.0007965 0.002268 1.004
## phi_occ 0.2202 0.04845 0.1083 0.0003956 0.001163 1.006
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## p_det 0.3518 0.4861 0.5524 0.6184 0.7311
## phi_occ 0.1403 0.1864 0.2155 0.2483 0.3309
plot(Occ.DC)
pairs(Occ.DC)
dcdiag(Occ.DC)
## n.clones lambda.max ms.error r.squared r.hat
## 1 1 0.04883814 0.4917450 0.06417011 1.000623
## 2 2 0.02606550 1.8102381 0.18310070 1.008472
## 3 5 0.01037601 0.1801687 0.03099651 1.004170
plot(dcdiag(Occ.DC))
We have shown how hierarchical models can be used to deal with measurement error. Now we will look at a few examples where we use them to combine data across several studies.
We will start with a simple (but extremely important) example that started the entire field of mixture as well as hierarchical models (Neyman and Scott, 1949).
Researchers in animal husbandary wanted to know how to improve the stock of animals such as milk cows and bulls. This played an important role in the ‘white revolution’ that lead to improving the nutrition in many countries. Following is a somewhat made up and highly simplified situation.
Suppose we have n cows. We want to know which cows have good genetic potential that can be passed on to the next generation. Each cow might have only a few calves. We measure the amount of milk by each calf.
Let \(Y_ij\) be the amount of milk produced by the j-th calf of the i-th cow. We can consider a linear model that uses the ‘cow effect’ (genetic) and ‘environmental effect’ to explain the amount of milk. This is same as one way ANOVA model.
\(Y_{ij}=\mu+\alpha_{i}+\epsilon_{ij}\)
Under the usual Gaussian error structure, we know that
\(Y_{ij}\sim N(\mu_{i},\sigma^{2})\) where \(i=1,2,...,n\) and \(j=1,2\) (\(\mu_{i}=\mu + \alpha_i\)).
There are \(2 n\) observations and \(n+1\) parameters. The number of parameters increases at the same rate as the number of observations. Note that the ratio of parameters to observations converges to 0.5. Roughly speaking, for the MLE to work, this ratio needs to go to 0. Generally the number of parameters is fixed and hence this condition is satisfied.
We simply do not have much information about each \(\mu_i\) as there are only two observations corresponding to it. It also turns out that the ML estimator of \(\sigma^2\) converges to \(0.5*\sigma^2\). Hence it is not consistent even though the number of observations corresponding to it do converge to infinity. This was a major blow to the theory of maximum likelihood. Although it turns out Fisher had implicitly answered it a decade before this paper.
How can we reduce the number of parameters?
This model is a hierarchical model.
For a Bayesian approach, we put priors on the three parameters. This forms the third hierarchy.
This simple model can be used in many different situations.
Let us see what makes it a difficult model to analyze using the likelihood approach.
In order to write down the likelihood function, we need to compute the marginal distribution of the observations. Remember \(\mu_i\) are not observed. Hence we have to integrate over them.
\[ f(y_{ij};\mu,\sigma^{2},\tau^{2})=\int f(y_{ij}|\mu_{i})g(\mu_{i})d\mu_{i} \]
Again, this is not a precise statement but it gives you the idea. This integral is one dimensional and hence can be computed analytically and also numerically. However, in many cases the dimension of the integral is quite large and hence neither of these solutions are available. In some cases, one can obtain Laplace approximation to this integral (INLA and related methods rely on this approximation). The most general approach is based on the MCMC algorithm.
n = 30
mu_true = 2
tau_true = 0.5
sigma_true = 1
mu = rnorm(n, mu_true, tau_true)
Y = cbind(
rnorm(n, mu, sigma_true),
rnorm(n, mu, sigma_true))
We will write the Bayesian model first. Remember the normal distribution is defined in terms of the precision (inverse of the variance).
LM_Bayes = function(){
# Likelihood
for (i in 1:n){
mu[i] ~ dnorm(mu.t, prec_tau)
for (j in 1:2){
Y[i,j] ~ dnorm(mu[i], prec_sigma)
}
}
# Priors
mu.t ~ dnorm(0, 0.01)
prec_tau ~ dgamma(0.1, 0.1)
prec_sigma ~ dgamma(0.1, 0.1)
# Parameters on the natural scale
tau <- sqrt(1/prec_tau)
sigma <- sqrt(1/prec_sigma)
}
Data in array form for data cloning purpose
Y = as.array(Y, dim=c(n,2,1))
Y = dcdim(Y)
dat = list(Y=Y, n=n)
LM_Bayes_fit = jags.fit(data=dat, params=c("mu.t","tau","sigma"), model=LM_Bayes)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 60
## Unobserved stochastic nodes: 33
## Total graph size: 102
##
## Initializing model
summary(LM_Bayes_fit)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu.t 1.9420 0.1459 0.0011911 0.002758
## sigma 0.9263 0.1045 0.0008532 0.001651
## tau 0.4582 0.1510 0.0012328 0.004369
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu.t 1.6526 1.8453 1.9415 2.0390 2.2286
## sigma 0.7391 0.8524 0.9215 0.9924 1.1451
## tau 0.2154 0.3451 0.4436 0.5537 0.7931
We will modify it to do data cloning. This is useful to assure us that the parameters are estimable. It is also important for making it invariant to the choice of the priors and parameterization.
LM_DC = function(){
# Likelihood
for (k in 1:ncl){
for (i in 1:n){
mu[i,k] ~ dnorm(mu.t, prec_tau)
for (j in 1:2){
Y[i,j,k] ~ dnorm(mu[i,k], prec_sigma)
}
}
}
# Priors
mu.t ~ dnorm(0, 0.01)
prec_tau ~ dgamma(0.1, 0.1)
prec_sigma ~ dgamma(0.1, 0.1)
# Parameters on the natural scale
tau <- sqrt(1/prec_tau)
sigma <- sqrt(1/prec_sigma)
}
Data in array form for data cloning purpose.
Y = array(Y, dim=c(n,2,1))
Y = dcdim(Y)
dat = list(Y=Y, n=n, ncl=1)
LM_DC_fit = dc.fit(data=dat, params=c("mu.t","tau","sigma"), model=LM_DC,
n.clones=c(1, 2, 5), unchanged="n", multiply="ncl")
##
## Fitting model with 1 clone
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 60
## Unobserved stochastic nodes: 33
## Total graph size: 103
##
## Initializing model
##
##
## Fitting model with 2 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 120
## Unobserved stochastic nodes: 63
## Total graph size: 193
##
## Initializing model
##
##
## Fitting model with 5 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 300
## Unobserved stochastic nodes: 153
## Total graph size: 463
##
## Initializing model
summary(LM_DC_fit)
##
## Iterations = 1001:6000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
## Number of clones = 5
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD DC SD Naive SE Time-series SE R hat
## mu.t 1.9383 0.06202 0.1387 0.0005064 0.001477 1.003
## sigma 0.9289 0.04916 0.1099 0.0004014 0.001209 1.001
## tau 0.3739 0.08948 0.2001 0.0007306 0.004310 1.004
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu.t 1.8200 1.8969 1.9377 1.9799 2.0627
## sigma 0.8343 0.8950 0.9281 0.9615 1.0278
## tau 0.2083 0.3104 0.3710 0.4349 0.5527
pairs(LM_DC_fit)
dcdiag(LM_DC_fit)
## n.clones lambda.max ms.error r.squared r.hat
## 1 1 0.02423940 0.21841807 0.022529962 1.003789
## 2 2 0.01521459 0.07551335 0.010411112 1.003720
## 3 5 0.00871944 0.01840762 0.002994629 1.004627
plot(dcdiag(LM_DC_fit))
The best thing about the MCMC approach is that we can modify the prototype program to do Generalized linear mixed models.
Let us see how we can change the program to do Poisson regression with random intercepts. This is useful for accounting for missing covariates in the usual Poisson regression. This can also be used to account for site effect in abundance surveys.
Mathematically the model is:
n = 30
mu_true = 2
tau_true = 0.5
sigma_true = 1
mu = rnorm(n, mu_true, tau_true)
Y = rpois(n, exp(mu))
We will write the Bayesian model first. Remember the normal distribution is defined in terms of the precision (inverse of the variance).
GLMM_Bayes = function(){
# Likelihood
for (i in 1:n){
mu[i] ~ dnorm(mu.t, prec_tau)
Y[i] ~ dpois(exp(mu[i]))
}
# Priors
mu.t ~ dnorm(0, 0.01)
prec_tau ~ dgamma(0.1, 0.1)
# Parameters on the natural scale
tau <- sqrt(1/prec_tau)
lambda <- exp(mu.t)
}
dat = list(Y=Y, n=n)
GLMM_Bayes_fit = jags.fit(data=dat, params=c("lambda","tau"), model=GLMM_Bayes)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 30
## Unobserved stochastic nodes: 32
## Total graph size: 100
##
## Initializing model
summary(GLMM_Bayes_fit)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## lambda 7.0900 0.8696 0.0071004 0.011371
## tau 0.5348 0.1084 0.0008854 0.001917
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## lambda 5.4744 6.4896 7.0582 7.6512 8.8903
## tau 0.3522 0.4579 0.5243 0.5996 0.7745
As usual, we can turn the crank for data cloning to check the estimability of the parameters.
GLMM_DC = function(){
# Likelihood
for (k in 1:ncl){
for (i in 1:n){
mu[i,k] ~ dnorm(mu.t,prec_tau)
Y[i,k] ~ dpois(exp(mu[i,k]))
}
}
# Priors
mu.t ~ dnorm(0, 0.01)
prec_tau ~ dgamma(0.1, 0.1)
# Parameters on the natural scale
tau <- sqrt(1/prec_tau)
lambda <- exp(mu.t)
}
# Data in array form for data cloning purpose
Y = array(Y, dim=c(n,1))
Y = dcdim(Y)
dat = list(Y=Y, n=n, ncl=1)
GLMM_DC_fit = dc.fit(data=dat, params=c("lambda","tau"), model=GLMM_DC,
n.clones=c(1, 2, 5), unchanged="n", multiply="ncl")
##
## Fitting model with 1 clone
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 30
## Unobserved stochastic nodes: 32
## Total graph size: 101
##
## Initializing model
##
##
## Fitting model with 2 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 60
## Unobserved stochastic nodes: 62
## Total graph size: 191
##
## Initializing model
##
##
## Fitting model with 5 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 150
## Unobserved stochastic nodes: 152
## Total graph size: 461
##
## Initializing model
summary(GLMM_DC_fit)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
## Number of clones = 5
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD DC SD Naive SE Time-series SE R hat
## lambda 7.0895 0.37503 0.8386 0.003062 0.0052311 1.001
## tau 0.5079 0.04581 0.1024 0.000374 0.0008626 1.000
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## lambda 6.3718 6.832 7.0865 7.3438 7.830
## tau 0.4247 0.476 0.5057 0.5374 0.603
exp(mu_true)
## [1] 7.389056
tau_true
## [1] 0.5
In clinical trials, we are interested in estimating the effect of the treatment. One of the simplest forms of clinical trial is where we split a group of patients in two groups randomly. One of the group gets the treatment and the other gets a placebo.
We can then estimate the difference in the outcomes. This may be done using a simple t-test if the outcome is a continuous measurement. If the patients are quite different from each other in terms of say age, blood pressure or some such physical characteristics that may affect the outcome, we adjust them by using a regression approach. We include these other covariates and the treatment/control indicator variable in the model. The effect of the treatment after adjusting for other covariates can be studied using such a regression model.
Often in practice, we may not have access to such covariates. Using random intercept model in regression is one way out of such a situation. In this case, we do consider the differences between patients but without ascribing them to any specific, known values of the covariates. The model one may consider is:
\[ Y_{i}=\beta_{0i}+\beta I_{(Treatment)}+\epsilon_{i} \]
As we have seen in the previous example (Neyman-Scott problem), this model is non-estimable. One way to make it estimable is by using a hierarchical structure:
Hierarchy 2: \(\beta_{0i}\sim N(\beta_{0},\tau^{2})\)
Homework: Check the validity of the following statement without doing any mathematics. You can use data cloning to do that.
This leads to estimability for \(\beta\), the parameter of interest. Although it does not lead to estimation of the variances \((\tau,\sigma)\).
An approach not described in this course: We can use profile likelihood for the parameter \(\beta\). This eliminates the ‘nuisance parameters’ \(\beta_0,\tau,\sigma\). Computing the profile likelihood and quantification of uncertainty for inferences based on it for hierarchical models can be tackled using data cloning. This could be another course!
Suppose the outcome is binary, survival for 5 years vs. failure before 5 years. In this case, a convenient model is a binary regression model such as a Logistic regression model.
Hierarchy 1:
\[ P(Y_{i}=1)=\frac{exp(\beta_{0i}+\beta I_{(Treatment)})}{1+exp(\beta_{0i}+\beta I_{(Treatment)})} \]
Hierarchy 2: \(\beta_{0i}\sim N(\beta_{0},\tau^{2})\)
This is an example of a Generalized Linear Mixed Model. Fortunately, for this model all parameters are estimable as we will see using data cloning. The random intercept here could be accounting for differences in the clinical centers (hospitals), assuming we have only two patients, one in control and one in the treatment group. If we have multiple patients in each group, we may include another random effect to account for differences in the patients within each group. For example, we may consider a model:
Hierarchy 1:
\[ P(Y_{ij}=1)=\frac{exp(\beta_{0i}+\beta I_{(Treatment)}+\beta_{1j})}{1+exp(\beta_{0i}+\beta I_{(Treatment)}+\beta_{1j})} \]
Hierarchy 2: \(\beta_{0i}\sim N(\beta_{0},\tau^{2})\), \(\beta_{1j}\sim N(\beta_{1},\tau^{2})\)
We can include interactions between random effects and so on. We will not go into these complex models in this course.
Let us see how one can modify the prototype program to analyze the random intercept Logistic regression model.
Each center has one control and one treatment patient.
n = 300 # Number of clinical centers
mu_true = 0
tau_true = 0.5
delta = 1 # Treatment effect
mu = rnorm(n, mu_true, tau_true)
Y = cbind(
rbinom(n, 1, plogis(mu)),
rbinom(n, 1, plogis(mu+delta)))
We will write the Bayesian model first. Remember the normal distribution is defined in terms of the precision (inverse of the variance).
GLMM_Bayes = function(){
# Likelihood
for (i in 1:n){
mu[i] ~ dnorm(mu.t,prec_tau)
Y[i,1] ~ dbin(ilogit(mu[i]),1)
Y[i,2] ~ dbin(ilogit(mu[i]+delta),1)
}
# Priors
mu.t ~ dnorm(0, 0.01)
prec_tau ~ dgamma(0.1, 0.1)
delta ~ dnorm(0, 0.1)
# Parameters on the natural scale
tau <- sqrt(1/prec_tau)
}
dat = list(Y=Y, n=n)
GLMM_Bayes_fit = jags.fit(data=dat, params=c("delta","mu.t","tau"), model=GLMM_Bayes)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 600
## Unobserved stochastic nodes: 303
## Total graph size: 1810
##
## Initializing model
summary(GLMM_Bayes_fit)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## delta 1.02466 0.1909 0.001558 0.007243
## mu.t 0.07583 0.1236 0.001009 0.005781
## tau 0.60057 0.2446 0.001997 0.030709
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## delta 0.6580 0.89500 1.02236 1.1513 1.4090
## mu.t -0.1648 -0.01046 0.07614 0.1594 0.3171
## tau 0.2387 0.40407 0.57343 0.7732 1.1100
We will modify this to get the MLE using data cloning.
GLMM_DC = function(){
# Likelihood
for (k in 1:ncl){
for (i in 1:n){
mu[i,k] ~ dnorm(mu.t,prec_tau)
Y[i,1,k] ~ dbin(ilogit(mu[i,k]),1)
Y[i,2,k] ~ dbin(ilogit(mu[i,k]+delta),1)
}
}
# Priors
mu.t ~ dnorm(0, 0.01)
prec_tau ~ dgamma(0.1, 0.1)
delta ~ dnorm(0, 0.1)
# Parameters on the natural scale
tau <- sqrt(1/prec_tau)
}
Y = array(Y, dim=c(dim(Y),1))
Y = dcdim(Y)
dat = list(Y=Y, n=n, ncl=1)
GLMM_DC_fit = dc.fit(data=dat, params=c("delta","mu.t","tau"), model=GLMM_DC,
n.clones=c(1, 2, 5), unchanged="n", multiply="ncl")
##
## Fitting model with 1 clone
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 600
## Unobserved stochastic nodes: 303
## Total graph size: 1811
##
## Initializing model
##
##
## Fitting model with 2 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1200
## Unobserved stochastic nodes: 603
## Total graph size: 3611
##
## Initializing model
##
##
## Fitting model with 5 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 3000
## Unobserved stochastic nodes: 1503
## Total graph size: 9011
##
## Initializing model
summary(GLMM_DC_fit)
##
## Iterations = 2001:7000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 5000
## Number of clones = 5
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD DC SD Naive SE Time-series SE R hat
## delta 1.04333 0.08504 0.1902 0.0006943 0.003189 1.001
## mu.t 0.05618 0.05498 0.1229 0.0004489 0.002438 1.003
## tau 0.62409 0.11771 0.2632 0.0009611 0.017587 1.012
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## delta 0.88065 0.98524 1.04297 1.10071 1.2106
## mu.t -0.05026 0.01822 0.05712 0.09293 0.1632
## tau 0.41675 0.53636 0.61734 0.70651 0.8597
pairs(GLMM_DC_fit)
dcdiag(GLMM_DC_fit)
## n.clones lambda.max ms.error r.squared r.hat
## 1 1 0.05043076 0.03286364 0.004560375 1.026371
## 2 2 0.05145537 0.02158761 0.002966015 1.205762
## 3 5 0.01522582 0.01916448 0.001221529 1.005063
plot(dcdiag(GLMM_DC_fit))
Following is a real data set on multi-center clinical trials. Number
of patients in the treatment are "nt" and those who
survived are "rt". Similarly "nc" are the
number of patients in the control group and "rc" are the
ones that survived.
OR.data = list(
"rt" =
c(3, 7, 5, 102, 28, 4, 98, 60, 25, 138, 64, 45, 9, 57, 25, 33,
28, 8, 6, 32, 27, 22),
"nt" =
c(38, 114, 69, 1533, 355, 59, 945, 632, 278, 1916, 873, 263,
291, 858, 154, 207, 251, 151, 174, 209, 391, 680),
"rc" =
c(3, 14, 11, 127, 27, 6, 152, 48, 37, 188, 52, 47, 16, 45, 31,
38, 12, 6, 3, 40, 43, 39),
"nc" =
c(39, 116, 93, 1520, 365, 52, 939, 471, 282, 1921, 583, 266,
293, 883, 147, 213, 122, 154, 134, 218, 364, 674),
"Num" =
22)
The model functions for DD are as follows.
DC.MLE.fn = function() {
for (k in 1:ncl){
for (i in 1:Num) {
rt[i,k] ~ dbin(pt[i,k], nt[i,k])
rc[i,k] ~ dbin(pc[i,k], nc[i,k])
logit(pc[i,k]) <- mu[i,k]
logit(pt[i,k]) <- mu[i,k] + delta
mu[i,k] ~ dnorm(alpha,tau1)
}
}
# Priors
parms ~ dmnorm(MuP, PrecP)
delta <- parms[1]
tau1 <- exp(parms[3])
sigma1 <- 1/sqrt(tau1)
alpha <- parms[2]
}
# Analysis
rt = dcdim(array(OR.data$rt, dim=c(22,1)))
nt = dcdim(array(OR.data$nt, dim=c(22,1)))
rc = dcdim(array(OR.data$rc, dim=c(22,1)))
nc = dcdim(array(OR.data$nt, dim=c(22,1)))
Num = OR.data$Num
dat = list(rt=rt, rc=rc, nt=nt, nc=nc, Num=Num, ncl=1,
MuP=rep(0, 3), PrecP=diag(0.01, 3, 3))
DC.MLE = dc.fit(data=dat, params="parms", model=DC.MLE.fn,
n.clones=c(1, 4, 16),
multiply="ncl", unchanged=c("Num","MuP","PrecP"),
n.chains=5, n.update=1000, n.iter=5000, n.adapt=2000)
##
## Fitting model with 1 clone
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 44
## Unobserved stochastic nodes: 23
## Total graph size: 200
##
## Initializing model
##
##
## Fitting model with 4 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 176
## Unobserved stochastic nodes: 89
## Total graph size: 728
##
## Initializing model
##
##
## Fitting model with 16 clones
##
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 704
## Unobserved stochastic nodes: 353
## Total graph size: 2840
##
## Initializing model
# Check the convergence and MLE estimates etc.
summary(DC.MLE)
##
## Iterations = 3001:8000
## Thinning interval = 1
## Number of chains = 5
## Sample size per chain = 5000
## Number of clones = 16
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD DC SD Naive SE Time-series SE R hat
## parms[1] -0.1961 0.01251 0.05005 7.914e-05 0.0002754 1.004
## parms[2] -2.2609 0.02977 0.11908 1.883e-04 0.0008587 1.005
## parms[3] 1.3586 0.08800 0.35199 5.565e-04 0.0062806 1.019
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## parms[1] -0.2206 -0.2047 -0.1961 -0.1876 -0.1715
## parms[2] -2.3173 -2.2810 -2.2607 -2.2411 -2.2025
## parms[3] 1.1753 1.3009 1.3611 1.4199 1.5227
dctable(DC.MLE)
## $`parms[1]`
## n.clones mean sd 2.5% 25% 50% 75%
## 1 1 -0.1942684 0.04861853 -0.2898628 -0.2258169 -0.1944701 -0.1616167
## 2 4 -0.1946009 0.02537714 -0.2461245 -0.2114561 -0.1940935 -0.1767720
## 3 16 -0.1960873 0.01251312 -0.2205589 -0.2047403 -0.1961376 -0.1876047
## 97.5% r.hat
## 1 -0.09774332 1.000767
## 2 -0.14539702 1.002668
## 3 -0.17146654 1.004162
##
## $`parms[2]`
## n.clones mean sd 2.5% 25% 50% 75%
## 1 1 -2.267148 0.12436319 -2.514560 -2.345987 -2.268035 -2.187160
## 2 4 -2.261876 0.05881435 -2.377168 -2.301915 -2.262521 -2.221238
## 3 16 -2.260867 0.02977115 -2.317269 -2.280967 -2.260705 -2.241095
## 97.5% r.hat
## 1 -2.028945 1.001747
## 2 -2.146913 1.003795
## 3 -2.202538 1.005013
##
## $`parms[3]`
## n.clones mean sd 2.5% 25% 50% 75% 97.5%
## 1 1 1.269417 0.36259431 0.5486396 1.018522 1.293586 1.516854 1.960875
## 2 4 1.353340 0.17078107 1.0294786 1.236529 1.351012 1.471465 1.688554
## 3 16 1.358610 0.08799719 1.1753083 1.300884 1.361063 1.419934 1.522684
## r.hat
## 1 1.006177
## 2 1.023020
## 3 1.019407
##
## attr(,"class")
## [1] "dctable"
dcdiag(DC.MLE)
## n.clones lambda.max ms.error r.squared r.hat
## 1 1 0.131503949 0.03284478 0.002697220 1.005085
## 2 4 0.029167745 0.01333623 0.001603985 1.022289
## 3 16 0.007743858 0.01355314 0.002209983 1.020749
It should be clear by now that we can modify any Bayesian analysis to get Maximum likelihood estimate quite easily by adding one dimension to the data and a do loop over the clones.
In the next part, we will discuss how to analyzed time series data.